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[57] ABSTRACT 

The present invention provides a method and apparatus for 
using input data to optimize a computer program. Initially, 
the computer program is divided into one or more logical 
units of code. Next, a CPU simulator is used to simulate 
execution of each logical unit using the input data. The 
output from the simulation is used to generate a first opti- 
mization metric value and corresponding state information 
for each logical unit. In one embodiment, the first optimi- 
zation metric value and corresponding state information are 
stored in a first optimiz ation v ector. (Using well known^ 
^^pptffiizatibn tecKniques^tb^^^^ each logical ^ 

lunii are optimized iteratively until. additional-Optimizations 7. 
would-result"in very ^mair_ incremental . performance 
impjpvements. A second simulation is performed using the 
same input~data except that this time the optimized logical 
units arc used. This second simulation is used to measure 
how much the optimizer has improved the code. The output 
from the second simulation is used to generate a second 
optimization metric value and corresponding state informa- 
tion. The degree of optimization is determined by determin- 
ing the difference between the first optimization metric value 
and the second optimization metric value for the sum of the 
logical units. I f the diff erence is Jess than a ■predetermined^ 
rthfesho Id ^ va lu e , additional optimization iterations would ^ 
provide little code improvement-and thus the optimization is"^ 
complete. However, if the difference is greater than or equal 
to the predetermined threshold value, additional optimiza- 
tions would likely improve performance. In the latter case, 
the present invention would repeat the optimization process 
described above. 

20 Claims, 4 Drawing Sheets 
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METHOD AND APPARATUS FOR 
DYNAMICALLY OPTIMIZING AN 
EXECUTABLE COMPUTER PROGRAM 
USING INPUT DATA 

RELATED APPLICAnONS 

This application is related to U.S. application Ser. No. 
08/865,335, filed May 28, 1997, entitled "METHOD AND 
APPARATUS FOR CONVERTING EXECUTABLE COM- 
PUTER PROGRAMS IN A HETEROGENEOUS COM- 
PUTING ENVIRONMENT', and naming Hari Ravichan- 
dran as inventor, and U.S. application Ser. No, 08/864,247, 
filed May 28, 1997, entitled "METHOD AND APPARATUS 
FOR GENERATING AN OPTIMIZED TARGET 
EXECUTABLE COMPUTER PROGRAM USING AN 
OPTMIZED SOURCE EXECUTABLE", and naming Hari 
Ravichandran as inventor, both of which are assigned to the 
assignee of the present invention and are herein 
incorporated, in their entirety, by reference. 

HELD OF THE INVENTION 

The present invention relates to computer compilers and 
interpreters. In particular, this invention relates to a tech- 
nique for optimizing an executable computer program using 
input data. 

BACKGROUND OF THE INVENTION 

In most cases, programmers write computer applications 
in high level languages, such as JAVA^ or C**, and rely on 
sophisticated optimizing compilers to convert the high level 
code into an efiScient executable computer program. Opti- 
mizing compilers have replaced the time consuming and 
commercially unfeasible task of handcoding applications in 
low level assembly or machine<ode for maximum effi- 
ciency. Instead, an optimizer portion of the compiler per- 
forms transformations on the code designed to improve 
execution times and utilize computer resources more effi- 
ciently. Accordingly, some of the code transformations are 
based on the structure of the code and not related to the 
target processor used for execution. Other types of traasfor- 
mations improve the code by utilizing specific features 
available on the target processor such as registers, pipeline 
structures and other hardware features. 

1. The Network Is the Computer, Sun, the Sun logo, Sun Microsystems, 
Solaris, Ultra and Java are trademarks or registered trademarks of Sun 
Microsystems, Inc. in the United States and in other countries. All SPARC 
trademarks are used under license and are trademarks or registered trade- 
marks of SPARC International, Inc. in the United States and other countries. 
Products bearing SPARC trademarks arc based upon an architecture devel- 
oped by Sun Microsystems, Inc UNIX is a registered trademark in the United 
States and other countries exclusively licensed through X/Open Company, 
Ud. 

Unfortunately, most optimizes available today only esti- 
mate which portions of the program will benefit most from 
the optimization routines before actual execution. These 
optimizers analyze the application code in a static state and 
without any input data. Graphing theory and complex data 
flow analysis are used at compile time to determine which 
portions of the code might be executed most frequently at 
execution time. If these determinations are not correct, 
frequently executed portions of the code or "hot spots" will 
not be optimized and the code will run less efficiently. Also, 
if the application is unusual or behaves in an atypical manner 
these techniques may also fail to optimize the code effec- 
tively. 

The current state of optimizing compilers must be sig- 
nificantly improved as the compiler becomes a more integral 
part of the processor performance. On many pipelined 
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multiple issue processor architectures, the performance of 
hardware based instmction schedulers have been enhanced 
using sophisticated optimizing compilers which generate an 
instruction stream which further exploits the underlying 

5 hardware. Unfortunately, these compilers show limited 
results optimizing computer programs using the current 
static optimization techniques. In practice, the typical opti- 
mizer has difficulty fully optimizing a computer program 
because the input data and actual runtime characteristics of 

10 the computer program are unknown at compile time. Con- 
sequently, the optimizer may only increase the performance 
of a computer program marginally. For example, a very- 
long-instruction-word (VLIW) computer relies primarily on 
an optimizing compiler to generate binaries capable of 

35 exploiting the VLIW processor's multiple issue capabilities. 
If the run -time characteristics of a particular computer 
program are complex, the VLIW optimizing compiler will 
not be able to predict the run time characteristics of the 
program and exploit all the processor's features adequately. 

20 Optimizers are also important when comparing proces- 
sor's based on industry standard benchmarks such as SPE- 
Cint95, SPECfp95, and TPCC/CB. These standard bench- 
marks are used to compare different processors based upon 
how quickly each processor executes the given benchmark 

25 program. An inherently powerful processor can appear rela- 
tively slow if the optimizing compiler does not optimize the 
code properly. As a result, the apparent throughput on a 
processor may be significantly less than a computer manu- 
facturer expects. 

30 What is needed is a dynamic optimizing compiler which 
uses input data to profile a computer application in a 
systematic manner suitable for producing a highly optimized 
executable. These techniques can be used to optimize 
executables on a wide variety of platforms. 

SUMMARY OF THE INVENTION 

According to principles of the invention, a method and 
apparatus for using input data to optimize a computer 
program for execution on a target computer is provided. 
40 Initially, the computer program is divided into one or more 
logical units of code. Next, a CPU simulator is used to 
simulate execution of each logical unit using the input data. 
The output from the simulation is used to generate a first 
optimization metric value and corresponding state informa- 
45 tion for each logical unit. In one embodiment, the first 
optimization metric value and corresponding state informa- 
tion are stored in a first optimization vector. Using well 
known optimization techniques, the instructions within each 
logical unit are optimized iterative ly using the first optimi- 
se zation metric value and corresponding slate information. 
The iterations continue until additional optimizations would 
result in very small incremental performance improvements 
and a diminishing return in performance compared with 
processing expended. A second simulation is performed 
55 using the same input data except that this time the optimized 
logical units are used. This second simulation is used to 
measure how much the optimizer has improved the code. 
The output from the second simulation is used to generate a 
second optimization metric value and corresponding state 
60 information. The degree of optimization is delennined by 
determining the difference between the first optimization 
metric value and the second optimization metric value. If the 
difference is less than a predetermined threshold value, 
additional optimization iterations would provide little code 
65 improvement and thus the optimization is complete. 
However, if the difference is greater than or equal to the 
predetermined threshold value, additional optimizations 
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would likely improve performance. In the latter case, the ope rating system is the Solaris operating system or any other 

present invention would repeat the optimization process multitaskiag, multiuser operating system with support for 

described above. object oriented programming languages such as the Java 

programming language or high level programming lan- 

BRIEF DESCRIPTION OF THE DRAWINGS 5 guages such as C. Also included in primary storage 116 is a 

no. 1 illustrates a computer network for practicing one ^^.^^^ ^24, such as the source code of a Java appli- 

embodiment of the present invention. ^ optmiizing compiler 126 for generatmg an 

^„ ^ . „ . .„ . „ executable computer programs 128. 

no. 2 is a flowchart lUustratmg the overaU processmg ^ is a flowchart iUustrating the overall processing 

performed by an optimizing compiler designed in accor- fo^^^ed by an optimizing compiler 126 designed in 

dance with one embodiment of the present mventioo, accordance with one embodiment of the present invention. 

FIG. 3 is a block diagram which illustrates the parameter Optimizing compiler 126 typically contains a front end 202. 

values stored in an optimization vector. a code generator 204, an optimizer 206 and a backend 208. 

RG. 4 is flowchart illustrating the detailed steps associ- First, the source code for a computer program is generated 

ated with an optimizer designed in accordance with the is by a user and provided to front end 202 of the compiler 

present invention. where various pre-processing functions are performed. 

FIG. 5 is a flowchart which illustrates in more detail the Next, the code is provided to the code generator 204 which 

steps used to perform optimizations on a basic block of generates a set of instructions expressed in an intermediate 

instructions. code which is scmantically equivalent to the source code. 

20 Typically, the intermediate code is expressed in a machine- 

DETAILED DESCRIPTION independent format. 

HG. 1 illustrates a computer network 100 for practicing . ^° accordance with one embodiment of the present inven- 

one embodiment of the present invention. Computer net- T'"^?-^' ^^^^^^ intermediate mstruc- 

work 100 includes server computer systems 102 and 104 ^"^".^^ P^^^^°^^ ^^/^^"^ transformations to schedule 

configured to communicate with a client computer system ^5 Ihe mstruction set m a faster and more efficient manner. 

106 over a network 110. Preferably, the client and server ^^"^^ ^.^ these optimizations are concerned with improving 

computer systems coupled to this network transmit infor- ^^e logic of the code and some of the opUmizations are 

mation utiHzing the TCP/IP protocol. Other network proto- concerned with improving the code based upon the target 

cols such as SNA, X.25, Novell Netware\ Vines, or Apple- Processor used to execute the code Details on one embodi- 

Talk could also be used to provide similar cUent-server ment for implementing the presem mvention is discussed m 

communication capabilities. ^^^^^ ^^'J^^ ^'^i^T.o . .... 

1. Network is a registered trademark of Novell, Inc in the United States and N^Xl, backend 208 accepts the Optimized mtermediate 

other countries. code and generates a target executable 210 which includes 

Server 102 includes a network interface 112, a processor a set of machine instructions in binary format which can be 

114, a primary storage 116, a secondary storage 118, and an 35 executed on a specific target machine such as SPARC, Intel, 

I/O (input output) interface 120 which facilitates communi- PowerPC, or MIPS. Each machine instruction includes an 

cation between these aforementioned elements. Network operation code (opcode) portion and an operand portion 

interface 112 couples server 102 to network 110 and facili- containing one or more operands. The opcode portion of the 

tales communication between server 102 and other comput- machine instruction instructs the target machine to execute 

ers on the network. 40 specific functions. ITie operand portion of the instruction is 

Typically, processor 114 on server 102 fetches computer used to locate data stored in a combination of registers or 

instructions from primary storage 116 through I/O interface memory available during execution. 

120. After retrieving these instructions, processor 114 Embodiments of the present invention provide a novel 

executes these computer instructions. Executing these com- technique for optimizing a computer program using a CPU 

puter instructions enables processor 114 to retrieve data or 45 simulaiorandinpuldata.Aserics of optimization vectors are 

write data to primary storage 116, secondary storage 118, used to store information and drive the optimization process, 

display information on one or more computer display According, a brief discussion of the optimization vector and 

devices (not shown), receive command signals from one or the parameters stored within it is useful in understanding the 

more input devices (not shown), or retrieve data or write present invention and the overaU optimization process, 

data to other computer systems coupled to network 110 such 50 Referring to FIG. 3, a block diagram indicates the typical 

as server 104, and client 106. Those skilled in the art will optimization metrics stored in an optimization vector 300 

also understand that primary storage 116 and secondary and associated with each basic block. First, a basic block 

storage 118 can include any type of computer storage identifier 302 is stored in optimization vector 300 to quickly 

including, without limitation, randomly accessible memory identify the basic block currently being optimized. In one 

(RAM), read-only-memory (ROM), application specific 55 embodiment of the present invention, the optimization vec- 

integra ted circuits (ASIC) and storage devices which include tor information is stored as a linked list. Basic block 

magnetic and optical storage media such as CD-ROM. In identifiers 302 in this linked list contain a value generated 

one embodiment, processor 114 can be any of the SPARC using the memory offsets of each basic block as an input 

compatible processors, UltraSPARC compatible processors, value to a quadratic hashing function. By hashing values in 

or Java compatible processors available from Sun Micro- 60 this manner, each basic block is located more quickly and 

systems. Inc. of Mountain View, Calif. Alternatively, pro- accurately than possible in non-hashed search techniques, 

cesser 114 can be based on the PowerPC processor available As an alternative to a linked list, the optimization vectors 

from Apple, Inc. of Cupertino, Calif., or any of the Pentium could be stored in a large hashing table having N entries, 

or x86 compatible processors available from the Intel Cor- In another portion of optimization vector 300, frequency 

poration or other corporations such as AMD, and Cyrix. 65 index 304 provides a metric indicating how many instruc- 

Primary storage 116 includes an operating system 122 for tions are executed within the particular basic block. A large 

managing computer resources. In one embodiment, this value in frequency index 304 typically means the associated 
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basic block contains a "hot spot*' within the code and should zation metric value 314 used to store an optimization metric 

be the focus of the optimization process. Those skilled in the value associated with a particular basic block. In one 

art will understand that the frequency index 304 metric is embodiment, the optimization metric value 314 is the 

typically larger than the number of instructions in a basic weighted product of the number of instructions executed in 

block because many of the instructions are executed repeal- 5 the basic block and the clock cycles per instruction (CPI) 

edly. In one embodiment, each basic block is optimized in used to execute these instructions. Because the CPI is 

descending order based upon frequency index 304. Typi- generated for each basic block, analysis is performed at a 

cally, basic block identifier 302 is used to organize the relatively high degree of granularity which in turn provides 

sequence in which each basic block is analyzed and opti- more accurate results. The weight attributed to these two 

mized. Unlike prior solutions, the present invention uses a lo parameters(numberof instructions executed in a basic block 

CPU simulator to determine which areas of the program and CPI) is user defined and depends on the empirical results 

actually are executed most frequently. Consequently, the generated by the specific application or use. These two 

present invention operates more reliably and more efficiently values are obtained primarily from simulation results per- 

than prior art systems. formed by a CPU simulator. TTiis value is important in 

Referring to FIG. 3, a live register list 306 provides the list 15 determining the incremental improvement obtained in the 

of live registers and address registers used in the current most recent optimization and moreover when the optimiza- 

version of the optimized basic block. Those skilled in the art tion process is complete. Details on how the optimization 

understand that address registers are typically used in pro- metric value is utilized will be discussed in the entirely later 

cessors where load instructions can only use registers and herein. 

can not access memory directly. The address registers are 20 Referring to FIG. 4, a flowchart diagram of optimizer 206 

typically reserved for load instructions. Accordingly, keep- (FIG. 2) provides detailed steps associated with one embodi- 

ing a list of the live registers and address registers indicates ment of the present invention. Initially, this process assumes 

how eflSciendy the registers on the processor are being used. that a compiler has generated an initial set of architecturally 

It also provides the optimizer with an indication of how neutral instructions which are similar to the underlying 

much register pressure the current basic block contributes to 25 instmctions used by the CPU used during execution. Alter- 

the overall execution profile. Optimization techniques con- natively, these initial instructions could also be a set of 

cerned with utilizing registers more eflBciently will use live instructions which could be executed directly on the under- 

register list 306 during the optimization process lying CPU. Accordingly, the initial instruction set being 

A code stability indicator 308 in optimization vector 300 processed could contain either the former architecturally 

indicates whether local code optimization schemes have 30 neutral or the latter architecturally specific instruction types, 

been exhausted and no more code movement is going to Referring to FIG. 4, at step 402 a cycle accurate CPU 

occur. Before optimization schemes are applied, the initial simulator receives an initial set of instructions from code 

value associated with stability indicator 308 indicates that generator 204 (FIG. 2). These instructions are generally 

the code is unstable and that additional code optimizations divided into one or more logical units of code to simplify the 

can be made to the basic block. At this stage, the code is 35 optimization process later on. In one embodiment, these 

considered unstable because each local optimization typi- logical units are basic blocks of instructions having only one 

cally rearranges or modifies the existing code. Eventually, entrance instruction and one exit instruction. The cycle 

when no more local code optimizations can be made, code accurate CPU simulator uses an input data to simulate the 

stability indicator 308 indicates that the code is stable and execution of these instructions on a target processor, llie 

that local optimizations for the particular basic block have 40 simulation indicates which instructions will not be executed 

been exhausted. Details on determining when code in the and the frequency at which the instructions are executed 

basic block is unstable or stable is discussed in further detail given the particular input data. Moreover, the simulation 

below. results also indicate how much time or instruction cycles 

In yet another portion of optimization vector 300, a state each instruction will take to execute. Unlike other optimiz- 
definition 310 is used to store information concerning the 45 ers, the present invention generates actual execution infor- 
slate of the code in a given basic block. This state informa- mation based upon specific input information as it relates to 
tion typically includes detailed information on the registers a particular processor. Knowing which instructions are actu- 
used in the basic block and the dependencies on these ally executed in a program improves data flow analysis and 
registers between instructions and between different basic the optimization process on a whole. Information on gen- 
blocks. The state information in state definition 310 is 50 erating a CPU simulator is discussed in "Execution Driven 
typically updated with modifications to the code each time Simulation of a Superscalar Processor", H. A. Rizvi, et. al, 
an optimization transformation is applied to a particular Rice University, 27th Annual Hawaii International Confer- 
basic block. In one embodiment, a pointer in state definition ence on System Sciences, 1994 and "Performance Estima- 
310 indicates where the state information is located else- tion of Multistreamed, Superscalar Processors", Wayne 
where in memory. Alternatively, state information could 55 Yamamoto, et. al. University of Santa Barbara, 27 Annual 
actually be contained within the area reserved for state Hawaii International Conference on System Sciences, 1994. 
definition 310. Details on the actual slate information asso- At step 404 (FIG. 4) global code dependencies are deter- 
ciated with state definition 310 are discussed in further detail mined and stored in a state table foria ier re feren^^duHng^ 
below. ijJie^tjmiz ation pr occs s,^Typicall>\^ach basic block within 

Another portion of optimization vector 300 is a cross 60 the^rogram has an entry in the state table which stores 

block optimization option 312 which is used to drive opti- information on the code in a particular basic block. For 

mizations which take place between basic blocks of instruc- example, the state table includes extensive information on 

tions within the program. Accordingly, cross block optimizer the registers used within the basic block. This includes 

option 312 includes a parameter which detenmines how whether certain registers within the basic block are "dead" 

many basic blocks outside the current basic block should be 65 or "live" and the register pressure associated with the 

analyzed for cross block lypes of optimizations. operations within the basic block. Those skilled in the art 

The next portion of optimization vector 300 is a optimi- will understand that a "live" register has stored a value 
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which is going to be used by a subsequent instruction while determined by taking the difference between the first opti- 

'*dead" indicates that a particular register is no longer mization metric value and the second optimization metric 

required by a subsequent register in the basic block. value. 

The Slate table entries generated at step 404 also include Next, decision step 410 determines if the current com- 

the stale of dependencies of instructions between basic 5 puler program has been optimized most eflBciently and the 

blocks. In one embodiment, these dependencies are deter- optimization process is complete. This step also uses a 

mined using a variation on the Tomosula Algorithm for convergence technique similar to the one discussed above 

determining instruction dependencies as described in "Com- with respect to basic block optimizations. In particular, 

puter Architecture: Pipeline and Processor Design", Michael decision step 410 determines if another iteration of optimi- 

Flynn, Ch. 7, Jones and Bartlett Publishers, 1995 and is lO zation should be performed on the computer program by 

herein incorporated by reference in the entirety. Accord- analyzing the change optimization metric values as a result 

ingly, these dependencies are categorized within the state of the most recent optimizations. As previously discussed, 

table as being either essential, order, or output. this information is typically located in the optimization 

Further, the statnable~ehtries^ generated^aTstep 404falso^ vector associated with each basic block. If the change in the 

^include. inform atibnxn external or interrialTeferences made^ 15 optimization metric value is smaller than the predetermined 

within the basic block to procedures or objects. The internal minimum threshold, processing transfers from decision step 

references refers to a portion of code withiii the current 410 to step 414 because additional optimization iterations 

object while an external reference refers to a portion of code would not likely produce significant results. In accordance 

not within the current object not being analyzed and thus with the present invention, this would indicate that the 

requires linking or loading a library or external object. In 20 optimization for the program is complete. Alternatively, if 

most cases, not loading or linking the externally or internally the change in optimization metric value is greater than or 

referenced code is not fatal to the optimization process but equal to a predetermined minimum threshold, additional 

will reduce the accuracy of the process and potentially the optimizations may increase the performance of the computer 

degree of optimization. Those skilled in the art understand program and justify the additional intervening processing, 

that the above enUies included in the slate table are only 25 Accordingly, processing would transfer to step 412 where 

exemplary and should not be construed to limit the amount the original and optimized optimization vectors would be 

of information stored. In particular, the state table could swapped. Processing would then transfer back to step 404 

potentially include any and all information derived from a where state information and optimization vector information 

CPU simulator which simulates the execution of a computer is generated. Steps 402-410 are repeated until the optimi- 

program given input data. ' 30 zation criteria discussed above is met. 

Step 404 also generates an optimization vector, such as Referring to FIG. 5, a flowchart provides a more detailed 

optimization vector 300 in FIG. 3, for each basic block description of the steps used to perform local optimizations 

which stores a number of optimization parameters used on a given basic block of instructions at step 406 (FIG. 4). 

iteratively during the optimization process. One element in Essentially, each basic block in a computer program is 

the optimization vector includes the optimization metric 35 optimized individually before the computer program is 

parameter. As previously mentioned, this parameter is the profiled using the CPU simulator. Initially, each basic block 

weighted product of the number of instructions executed in is assumed to be in an "unstable" state since optimizations 

the basic block and the clock cycles per instruction (CPI) or transformations on the code invariably delete or rearrange 

used to execute these iastructioas. It is an important param- the basic block instructions. To indicate this, the initial 

eter in evaluating the efficiency of each optimization itera- 40 optimization vector for each basic block indicates the basic 

tion. Potentially, the optimization metric parameter and block is unstable. When the optimization on the basic block 

other parameters change as each basic block is optimized, is complete, the stability flag indicates a "stable" state. 

Next, processing transfers from step 404 to step 406 In FIG. 5, step 502 performs a group of optimization 

where optimizations of each basic block are performed using techniques well known in the art to the basic block code 

the CPU simulator output information. The most recent 45 section. Parameter information contained with the optimi- 

optimizaiion vector is used to optimize the current execut- zation vector drives these optimization transformations on 

able computer program and achieve a new associated execu- each basic block. In one embodiment, the optimization 

tion efiBciency. Typical optimizations include: invariance techniques include invariance optimization, redundant code 

optimization; redundant code optimizations; global code optimizations, global code motion, local code motion, loop 

motion; local code motion; loop unrolling; basic block 50 uruoUing, basic block boundary detection, and "dead code" 

boundary detection; and "dead code'* removal optimizations. removal optimizations. 

The local optimizations are repeated on each basic block Ideally, each transformation deletes or moves code which 

until a predetermined level of optimization is converged results in reducing the CPI (clock cycles per instructions) 

upon. Methods associated with determining this conver- and overall execution time. Step 504 modifies parameters in 

gence point are discussed in more detail below with respect 55 the optimization vector to reflect changes due to these 

to FIG. 5. With regards to information on optimization optimization transformations. Topically, this involves reduc- 

techniques see, "Compilers: Principles, Techniques, and ing the optimization metric value by some amount with each 

Tools", Chapter 10, Alfred V. Aho, Ravi Sethi, Jeffrey D. iteration. 

Ullman, Addison- Wesley, 1988. Referring to FIG. 5, decision step 504 determines if 

At step 408, a second simulation is performed using CPU 60 another iteration of optimization should be performed on the 

simulator 408 and the same input data except that this time basic block by analyzing the estimated change in CPI and 

the optimized basic blocks of code are used. This second reduction in execution time as a result of the most recent 

simulation is used to measure how much the optimizer has optimizations. If the change in CPI is greater than or equal 

improved the code. In one embodiment, the output from the to a predetermined minimum threshold, additional optimi- 

first simulation at step 408 is used to generate a second 65 zations may increase the performance of this particular basic 

optimization metric value and corresponding state informa- block and justify the additional intervening processing. In 

tion for each basic block. The degree of optimization is thus case, the basic block remains unstable and the corre- 
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Spending stability entry in the optimization vector is not 
changed. Further, additional optimizations are performed on 
this basic block. Alternatively, if the change in CPI is smaller 
than the predetermined minimum threshold, processing 
transfers from decision step 504 to step 506 because addi- 5 
tional optimization iterations would not likely produce sig- 
nificant results. In the latter case, the basic block is deemed 
stable and the corresponding stability entry in the optimi- 
zation vector is set to indicate such a condition. This would 
also indicate that optimization for the particular basic block lO 
is complete. 

Processing then transfers from decision step 506 to step 
508 where cross block optimizations are performed to take 
advantage of specific instructions located in other basic 
blocks in the computer program. In one embodiment, cross 15 
block optimizations are useful in a superscalar architecture 
computer, such as SPARC, where two instructions have a 
long execution latency and a third independent instruction 
can be executed between them. Cross block optimization 
heuristics search through a number of adjacent basic blocks 20 
for instructions which can be executed out-of-order between 
instructions with a long execution latency. Unlike hardware 
based techniques for performing out-of-order execution, this 
method does not require special on-chip buffers to process 
more basic blocks and to re-order instruction results once 25 
they complete. Instead, the present invention performs all 
these calculations in software. 

The present invention offers several advantages in com- 
piler optimization which were previously unavailable in the 
art. Unlike existing techniques, each optimization vector is 30 
generated using actual input data for each executable com- 
puter program. In the past, the user estimated the mn-time 
characteristics of the program at compile time and provided 
a single optimization vector used for each basic block in the 
program. Instead, the present invention actually simulates 35 
the runtime characteristics of the program before run time. 
This simplifies data flow analysis and makes most optimi- 
zation techniques more eflBcient. 

Another advantage of the present invention is the relative 
ease of use in which a user can generate an optimized 40 
executable. In the past, the user was required to select a 
predetermined optimization vector for use by the compiler. 
This often required the user to analyze the type of code being 
used in the computer program and also adjust various input 
parameters to the optimizer accordingly. Using the present 45 
invention, the CPU simulator provides an initial optimiza- 
tion vector and then iteratively modifies it based on data 
input drive simulations. This makes the present invention 
accurate and yet obviates the need for complex decision 
making initiated by the user. 50 

This technique is also advantageous because the optimi- 
zation techniques used by the optimizer automatically shift 
with the type of executable program being executed. In the 
past, some optimizers performed transformations found sta- 
tistically useful in a particular class of executable programs 5S 
(i.e. graphic intensive, floating point intensive, highly par- 
allelized). Often, existing optimizers would have limited 
improvement in a computer program which was unusual or 
could not be easily categorized. The present invention 
overcomes these limitations by using data flow information 60 
generated specifically for the current computer program. For 
example, embodiments of the present invention use a feed- 
back loop to the CPU simulator to generate actual data flow 
information associated with the specific computer program. 
ITiese results are used to drive the optimization process 65 
iteratively until the most optimal binary is generated. 

Naturally, the present invention is extremely useful in 
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generating high performance benchmarks for a particular 
target processor. The techniques in the present invention are 
designed to tailor an application and specific data input 
values to a particular tai^et processor using empirical data 
generated from a CPU simulator. Computer manufacturers 
and others using benchmarks can use the present invention 
to generate faster benchmarks and thus tout faster through- 
put times on their products. 

While specific embodiments have been described herein 
for purposes of illustration, various modifications may be 
made without departing from the spirit and scope of the 
invention. Those skilled in the art understand that the present 
invention can be implemented in a wide variety of compiler 
and interpreter technologies and is not limited to computer 
systems executing the compilers used for the SPARC archi- 
tecture. Alternative embodiments substantially similar to the 
preferred embodiment could be implemented except that the 
compilers are used to generate Java Bytecodes for the Java 
Virtual Machine, executables for the Java line of processors 
such as the PicoJava, NanoJava, MicroJava, and UltraJava 
architectures or the PowerPC processor available from 
Apple, Inc. of Cupertino, Calif., or any of the Pentium or x86 
compatible processors available from the Intel Corporation 
or other corporations such as AMD, and Cyrix. Further, 
those skilled in the art understand that results generated by 
iteratively running a set of executable instructions through 
the CPU simulator could also be used for modifying the 
architectural subsystems in the processor such as memory, 
cache, specialized instructions, general purpose instructions. 
Furthermore, another alternative embodiment substantially 
similar to the preferred embodiment could be implemented 
except that the convergence on a particular optimization 
vector is limited by a predetermined number of feedback 
iterations rather than a predetermined threshold level. 
Accordingly, the invention is not limited to the above 
described embodimenLs, but instead is defined by the 
appended claims in light of their full scope of equivalents. 

What is claimed is: 

1. A method for using input data to optimize a computer 
program for execution on a target computer, the method 
comprising the steps of: 

dividing the computer program into one or more logical 
units of code; 

simulating execution of each logical unit using the input 
data; 

generating a first optimization metric value and corre- 
sponding state information for each logical unit based 
upon the corresponding simulation; 

optimizing the instructions within each logical unit 
according to the corresponding state information pre- 
viously generated; 

simulating execution of each optimized logical unit using 
the input data; 

generating a second optimization metric value and corre- 
sponding state information for each logical unit based 
upon the corresponding simulation; 

determining the difference between the first optimization 
metric value and the second optimization metric value 
and, 

if the difference is less than a predetermined threshold 
value, indicating that the optimization is complete, 

if the difference is greater than or equal to the prede- 
termined threshold value, repeating the steps above 
except replace the computer program with the opti- 
mized computer program. 

2. 'ITie method of claim 1 wherein the step of optimizing 
the instructions within each logical unit further comprises 
the steps of. 
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performing one or more optimization transformations on 

the code in each logical unit; 
adjusting the optimization metric to reflect changes in the 

code such as code removal or simplification of code 

logic; 

determining the change in the optimization metric and, 
if the change in the optimization metric is greater than or 

equal to a predetermined threshold level, repeating the 

above steps except using the optimized logical unit in 

lieu of the logical unit, 
if the change in the optimization metric is less than the 

predetermined threshold level, indicating the logical 

unit is optimized. 

3. The method of claim 2 wherein the optimization 
transformations on each logical unit includes invariance 
optimizations. 

4. The method of claim 2 wherein the optimization 
transformations on each logical unit includes local code 
motion. 

5. The method of claim 2 wherein the optimization 
transformations on each logical unit includes dead code 
removal optimizations. 

6. The method of claim 1 wherein the computer program 
and input data is provided for processing before execution 
on the target computer. 

7. llie method of claim 1 wherein the logical units are 
basic blocks of instructions having only one entrance 
instruction and one exit instruction. 

8. The method of claim 1 wherein the first and second 
optimization metric values for each logical unit includes a 
weighted product of the number of instructions executed in 
the basic block and the clock cycles per instruction (CPI) 
used to execute these instructions. 

9. The method of claim 1 wherein the state information 
includes register usage information, dependencies between 
instructions, and external references to other computer pro- 
grams. 

10. A compiler having an optimizer which uses input data 
to optimize a computer program for execution on a target 
computer, comprising: 

a division mechanism configured to divide the computer 
program into one or more logical units of code; 

a simulation mechanism configured to simulate execution 
of instructions in each logical unit using the input data; 

a first generation mechanism configm^ed to generate a first 
optimization metric value and corresponding first state 
information for each logical unit based upon a corre- 
sponding simulation of each logical unit; 

an optimization mechanism configured to optimize the 
instructions within each logical unit according to the 
corresponding first state information previously gener- 
ated; 

a second generation mechanism configured to generate a 
second optimization metric value and corresponding 
second state information for each optimized logical unit 
based upon a corresponding simulation of each opti- 
mized logical unit; 

a comparison mechanism configured to compare the dif- 
ference between the first optimization metric value and 
the second optimization metric value and, 

a first indicator mechanism coupled that the optimization 
is complete if the difference is less than a predeter- 
mined threshold value, or 

repeal the steps above except replace the computer 
program with the optimized computer program if the 
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difference is greater than or equal to a predetermined 
threshold value. 

11. The compiler of claim 10 wherein the mechanism 
configured to optimize the instructions within each logical 
unit further comprises, 

a mechanism configured to perform one or more optimi- 
zation U"ansformations on the code in each logical unit; 

a mechanism configured to adjust the optimization metric 
to reflect changes in the code such as code removal or 
simplification of code logic; 

a mechanism configured to determine the change in the 
optimization metric and, 

provide the optimized logical unit in lieu of the logical 
unit, if the change in the optimization metric is greater 
than or equal to a predetermined threshold level, or 

provide an indicator that the logical unit is optimized, if 
the change in the optimization metric is less than the 
predetermined threshold level. 

12. The compiler of claim 11 wherein the optimization 
transformations on each logical unit includes invariance 
optimizations. 

13. The compiler of claim 11 wherein the optimization 
transformations on each logical unit includes local code 
motion. 

14. The compiler of claim 11 wherein the optimization 
transformations on each logical unit includes dead code 
removal optimizations. 

15. The compiler of claim 10 wherein the computer 
program and input data is provided for processing before 
execution on the target computer. 

16. The compiler of claim 10 wherein the logical units are 
basic blocks of instructions having only one entrance 
instruction and one exit instruction. 

17. The compiler of claim 10 wherein the first and second 
optimization metric values for each logical unit includes a 
weighted product of the number of instmctions executed in 
the basic block and the clock cycles per instruction (CPI) 
used to execute these instructions. 

18. The compiler of claim 10 wherein the state informa- 
tion includes register usage information, dependencies 
between instructions, and external references to other com- 
puter programs. 

19. A computer program product comprising: 

a computer usable medium having computer readable 
code embodied therein which uses input data to opti- 
mize a computer program for execution on a target 
computer comprising: 

a first code portion configured to divide the computer 
program into one or more logical units of code; 

a second code portion configured to simulate execution 
of each logical unit using the input data; 

a third code portion configured to generate a first 
optimization metric value and corresponding state 
information for each logical unit based upon the 
corresponding simulation; 

a fourth code portion configured to optimize the 
instructions within each logical unit according to the 
corresponding state information previously gener- 
ated; 

a fifth code portion configured to simulate execution of 
each optimized logical unit using the input data; 

a sixth code portion configured to generate a second 
optimization metric value and corresponding state 
information for each logical unit based upon the 
corresponding simulation; 
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a seventh code portion configured to determine the 
difference between the first optimization metric 
value and the second optimization metric value and, 
provide an indicator that the optimization is com- 
plete if the difference is less than a predetermined 5 
threshold value, or 
repeat the steps above except replace the computer 
program with the optimized computer program if 
the difference is greater than or equal to a prede- 
termined threshold value, lO 
20. The code in claim 19 further comprising; 
a eighth code portion configured to perform one or more 
optimization transformations on the code in each logi- 
cal unit; 
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a ninth code portion configured to adjust the optimization 
metric to reflect changes in the code such as code 
removal or simplification of code logic; 

a tenth code portion configured to determine the change in 
the optimization metric and, 

provide a first indicator that the optimized logical unit 
should be used in lieu of the logical unit, if the change 
in the optimization metric is greater than or equal to a 
predetermined threshold level, or 

provide a second indicator that the logical unit is 
optimized, if the change in the optimization metric is 
less than the predetermined threshold level. 

***** 
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